Assignment 1

I’m here to learn new skills and improve olds. The course fits perdectly to my needs. I have used R a lot, but havent shared my code outside of our research group, so learning good practises is very welcome. I did go throw Exercise1 ‘warm up’ pretty fast and everything was pretty familiar. As a note for my self, the syntax and used functions was many way different than what i have used to get same goal. Might be good to start to use less ‘intermediate’ objects and more ‘tidyverse’ style. I like to use Atom for writing a script, so it’s be nice to try to do things differently.

My github: https://github.com/jvehvi/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 28 20:21:31 2022"

Assignment 2 Regression and model validation

This week topic was Regression and model validation. Here we present basic R commands to build up summary tables and linear models which can be used for finding cause-and-effect relationships between different variables.

Data wrangling

PART 1

# Set up workind dir
setwd("E:/Open science course 2022/IODS-project") 
dir.create(paste0(getwd(),"/Data"))
## Warning in dir.create(paste0(getwd(), "/Data")): 'E:\Open science course
## 2022\IODS-project\Data' already exists
setwd(paste0(getwd(),"/Data"))

PART 2

Bring .txt file to R environment which has been downloaded to Data - folder.

learning_2014<-as.data.frame(read.table("Data/learning2014.txt", header = TRUE, sep = "\t"))

Table should have 183 rows and 60 cols containing integer values in all except last column which contains information about gender by character.

PART 3

  • Subset Data set to contain columns: gender, age, attitude, deep, stra, surf and points.
  • Remove also rows which have 0 in Points.
  • Deep is calculated by taking the mean of cols: c(“D03”,“D11”,“D19”,“D27”,“D03”,“D11”,“D19”,“D27”,“D06”,“D15”,“D23”,“D31”) and excluding 0.
  • Stra is calculated by taking the mean of cols: c(“ST01”,“ST09”,“ST17”,“ST25”,“ST04”,“ST12”,“ST20”,“ST28”) and excluding 0.
  • Surf is calculated by taking the mean of cols: c(“SU02”,“SU10”,“SU18”,“SU26”,“SU05”,“SU13”,“SU21”,“SU29”,“SU08”,“SU16”,“SU24”,“SU32”) and excluding 0.
# Remove rows which have 0 in Points
analy_dataset<-subset(learning_2014, as.numeric(Points)!=0)
analy_dataset[analy_dataset==0]<-NA

# Build new dataframe (DF) for needed data. Gender, Age and Points are added straigth but for start Deep, Stra and Surf has been set to NA.
df<-data.frame(Gender=analy_dataset$gender,Age=analy_dataset$Age,Attitude=NA, Deep=NA,Stra=NA, Surf=NA, Points=analy_dataset$Points)

# Add Deep to DF by taking the mean

deep <- c("D03","D11","D19","D27","D03","D11","D19","D27","D06","D15","D23","D31")
df$Deep<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% deep], na.rm=TRUE)

# Add Stra to DF by taking the mean
stra <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
df$Stra<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% stra], na.rm=TRUE)

# Add Surf to DF by taking the mean
surf <- c("SU02","SU10","SU18","SU26","SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
df$Surf<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% surf], na.rm=TRUE)

# Add Attitude to DF by taking the mean
attitude <- c("Da","Db","Dc","Dd","De","Df","Dg","Dh","Di","Dj")
df$Attitude<-rowMeans(analy_dataset[,colnames(analy_dataset) %in% attitude], na.rm=TRUE)

PART 4

# Change working dir
setwd("E:/Open science course 2022/IODS-project")
# Write results to .csv format file
write.csv2(df,"Data/learning2014.csv")
# Read saved .csv file. Header= TRUE means that file has headers. sep=";" means that values are separated by ;. dec="," means that in the file ',' has been used for decimals
learning_2014.2<-as.data.frame(read.table("Data/learning2014.csv", header = TRUE, sep = ";", dec=","))

Analysis

PART 1 - Read file to R environment

Data set contains summary results of course ‘Johdatus yhteiskuntatilastotieteeseen, syksy 2014’ survey. There should be 7 variables (gender, age, attitude, deep, stra, surf and points) and 166 observations.’’’ Gender: Male = 1 Female = 2 Age: Age (in years) derived from the date of birth Attitude: Global attitude toward statistics. Mean of original variables (~Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj) Deep: Deep approach. Mean of original variables (~d_sm+d_ri+d_ue) Stra: Strategic approach. Mean of original variables ( ~st_os+st_tm) Surf: Surface approach. Mean of original variables (~su_lp+su_um+su_sb) Points: Total counts from survey. More information about used variables can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt’’’

# Read file to R environment
learning_2014.2<-as.data.frame(read.table("Data/learning2014.csv", header = TRUE, sep = ";", dec=","))
# Check dimensions of read table
dim(learning_2014.2)
## [1] 166   8

PART 2

By commend summary we can take a look ‘inside’ the data. By the command we get ‘boxplot information’ about our numerical data in dataframe. Scatterplot matrix is used to describe relationships between the variables. It’s constructed from the dataframe with ggpairs -function (ggplot2 -package). Result plot shows, in addition of variables relationships, variables diverging and gives correlation coefficients with asterix showing level of significance.

# Summary
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
summary(learning_2014.2[,2:ncol(learning_2014.2)])
##     Gender               Age           Attitude          Deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.375  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.250  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.750  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.645  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.000  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.875  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# Scatterplot
p <- ggpairs(learning_2014.2[-(1:2)], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 
print(p)

Most promising relationship seems to be between: Attitude and Points, and Surf and Deep.
There seems to be also some kind of relationship between: Stra and Surf. As overall, matrix gives good overlook of data, and starting point to study more relationships between variables’’’

PART 3 and PART 4

Create a regression model with multiple explanatory variables

my_model2 <- lm(Points ~ Attitude + Stra + Surf, data = learning_2014.2)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning_2014.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## Stra          0.8531     0.5416   1.575  0.11716    
## Surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Summary of a regression model shows that only Attitude seems to correlate significantly with Points. From the print, you can see model residiuals; summary table (estimate value, std. error, t-value and p-value for all variables in model against Points). Significance of variable correlation can be read from p-value(last column of Coefficients table). Significance levels threshols are given under the table. Summary gives also p-value for whole model which isn’t significant because model contains variables that hasn’t have relationship to Points. In next step, lest remove other variables from the model and see what happens

# Only Attitude seems to be significant, so lets do model again with only adding it
my_model3 <- lm(Points ~ Attitude, data = learning_2014.2)
summary(my_model3)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning_2014.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now results seems to be better and p-value is significant for the variable relationship as well as for the model. Multiple R-squared is the proportion of the variation in dependent variable that can be explained by the independent variable. So in the model where we haves three variables 20.74 % of the variation in Points can be explained by variables. But now interesting is that Attitude by itself explains 19.06 % of the variation. Showing us that Stra and Surf effect to the points is pretty minimal.

PART 5 - Study more our model with diagnostic plots

plot(my_model3, which=c(1,2,5))

From the ‘Residual vs leverage’ - plot we can check which and how many of observation are influential. In our case data seems good and there isnt any point outside Cook distance lines.

Also ‘Residual vs. Fitted’ - plot seems good. Data is divided evenly in x - and y-axle.

QQ-plot also indicates goodness of our model. If the points runs out too early from the line, there might be some other variables effecting our relationship more than the Attitude variable. In this case QQplot seems to be really nice, but noT perfect. ```


Assignment 3 - Logistic regression

Bring Data to R

For the study material from UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page (https://archive.ics.uci.edu/ml/datasets/Student+Performance) has been used as a source data table. The joined data set which will be used in the analysis, contains two student alcohol consumption data sets from the web page. The variables not used for joining the two data have been combined by averaging (including the grade variables): + ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ + ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise More information about variables can be found from: https://archive.ics.uci.edu/ml/datasets/Student+Performance

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ purrr   0.3.5
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Read data table to R - environment
wrangled_student_mat_por <- as.data.frame(read_csv("Data/wrangled_student_mat_por.csv"))
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(wrangled_student_mat_por)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Select intresting variables

For studying variables relationships to alcohol, I have selected four variables that I hypothesize to have relationship to person alcohol consumption.

  • Age - I think that persons over 18 use more alcohol.

  • romantic - I think that single persons drinks more often.

  • studytime - I think persons which study more weekly, drink less.

  • freetime - I think persons which have more free time might drink more.

library(ggplot2)
library(GGally)

# Summary table
int_variables<-c("age","romantic","studytime","freetime","alc_use","high_use")
summary(wrangled_student_mat_por[int_variables])
##       age          romantic           studytime        freetime    
##  Min.   :15.00   Length:370         Min.   :1.000   Min.   :1.000  
##  1st Qu.:16.00   Class :character   1st Qu.:1.000   1st Qu.:3.000  
##  Median :17.00   Mode  :character   Median :2.000   Median :3.000  
##  Mean   :16.58                      Mean   :2.043   Mean   :3.224  
##  3rd Qu.:17.00                      3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :22.00                      Max.   :4.000   Max.   :5.000  
##     alc_use       high_use      
##  Min.   :1.000   Mode :logical  
##  1st Qu.:1.000   FALSE:259      
##  Median :1.500   TRUE :111      
##  Mean   :1.889                  
##  3rd Qu.:2.500                  
##  Max.   :5.000
# Unique values for romantic column
unique(wrangled_student_mat_por$romantic)
## [1] "no"  "yes"
# Group by romantic before summarise
wrangled_student_mat_por %>% group_by(romantic) %>% summarise(count = n())
## # A tibble: 2 × 2
##   romantic count
##   <chr>    <int>
## 1 no         251
## 2 yes        119
# Scatterplot
p <- ggpairs(wrangled_student_mat_por[int_variables], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20))) 
p

Results shows that there seems to be some kind of relationship at least between alchol and freetime, - studytime and age. POsitive correlation can be seem between alcohol use and freetime. Negative correlation between alcohol use and study time, and alcohol use and age. Intresting find out is that young persons (15-18) seems to drink more than ages 19 -21.

Logistic regression

Use glm() to fit a logistic regression model with high_use as the target variable and freetime, studytime, age and romantic as the predictors.

m <- glm(high_use ~ freetime + studytime + age + factor(romantic), data = wrangled_student_mat_por, family = "binomial")

# Summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + studytime + age + factor(romantic), 
##     family = "binomial", data = wrangled_student_mat_por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6628  -0.8579  -0.6654   1.1899   2.3332  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -4.4470     1.7762  -2.504 0.012293 *  
## freetime              0.3269     0.1238   2.641 0.008274 ** 
## studytime            -0.5850     0.1572  -3.721 0.000198 ***
## age                   0.2246     0.1025   2.190 0.028512 *  
## factor(romantic)yes  -0.2158     0.2594  -0.832 0.405404    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 421.66  on 365  degrees of freedom
## AIC: 431.66
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                             OR        2.5 %    97.5 %
## (Intercept)         0.01171392 0.0003399901 0.3664590
## freetime            1.38664482 1.0914115708 1.7751196
## studytime           0.55710460 0.4053055149 0.7516885
## age                 1.25176377 1.0261046822 1.5355541
## factor(romantic)yes 0.80586877 0.4806650258 1.3321754

The results shows that widest range is for variable ‘romantic’. Range also contains 1 so most probably alcohol_consumption and romantic don’t have relationships, more like the variables are independent of each other. Freetime and age has OD and condidence interval > 1, so there seems to be positive correlation between those and high_use. Studytime values are < 1 meaning negative correlation between it and high_use.

Predictive power of the model

Next, I have build up new model excluding ‘romantic’ which seems to have no-relationship to students alcohol use. After building model, I have taken summary information about the model and calculated the mean prediction error.

library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Let's select only freetime, studytime and age for the next model
m <- glm(high_use ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial")

# Summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + studytime + age, family = "binomial", 
##     data = wrangled_student_mat_por)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6143  -0.8690  -0.6895   1.2088   2.2716  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2829     1.7668  -2.424 0.015345 *  
## freetime      0.3246     0.1236   2.626 0.008630 ** 
## studytime    -0.5928     0.1575  -3.765 0.000167 ***
## age           0.2119     0.1014   2.089 0.036700 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.36  on 366  degrees of freedom
## AIC: 430.36
## 
## Number of Fisher Scoring iterations: 4
# Making predictions on the train set.
# Probability
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, probability = predict(m, type = "response"))
# Prediction
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, prediction = probability > 0.5)
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = wrangled_student_mat_por$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
# Making predictions on the test set.
test_pred<- ifelse(predict(m, newdata = wrangled_student_mat_por, type = "response") > 0.5, "TRUE", "FALSE")
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred)
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
# We can study our model goodness more by looking ConfusionMatrix of the results
confusionMatrix(table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred))
## Confusion Matrix and Statistics
## 
##         prediction
## high_use FALSE TRUE
##    FALSE   245   14
##    TRUE     96   15
##                                           
##                Accuracy : 0.7027          
##                  95% CI : (0.6533, 0.7488)
##     No Information Rate : 0.9216          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1028          
##                                           
##  Mcnemar's Test P-Value : 1.136e-14       
##                                           
##             Sensitivity : 0.7185          
##             Specificity : 0.5172          
##          Pos Pred Value : 0.9459          
##          Neg Pred Value : 0.1351          
##              Prevalence : 0.9216          
##          Detection Rate : 0.6622          
##    Detection Prevalence : 0.7000          
##       Balanced Accuracy : 0.6179          
##                                           
##        'Positive' Class : FALSE           
## 
# Prediction plot
# access dplyr and ggplot2
library(dplyr); library(ggplot2)

# initialize a plot of 'high_use' versus 'probability' in 'wrangled_student_mat_por'
# define the geom as points and draw the plot
g <- ggplot(wrangled_student_mat_por, aes(x = high_use, y =as.numeric(probability))) +
    geom_point(aes(y = as.numeric(probability)), alpha = 0.2)

# ROC curve
library('pROC')
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
test_prob <- predict(m, newdata = wrangled_student_mat_por, type = "response")
test_roc <- roc(wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases

test_roc
## 
## Call:
## roc.formula(formula = wrangled_student_mat_por$high_use ~ test_prob,     plot = TRUE, print.auc = TRUE)
## 
## Data: test_prob in 259 controls (wrangled_student_mat_por$high_use FALSE) < 111 cases (wrangled_student_mat_por$high_use TRUE).
## Area under the curve: 0.6808
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
  mean(actual != predicted)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = wrangled_student_mat_por$prediction)
## [1] 0.2972973
# call loss_func to compute the average number of wrong predictions in the (test) data
# Error rate should be near each other with training and testing data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = test_pred)
## [1] 0.2972973

Our results shows that approximately 30 % of our predictions are wrong, so the accuracy for our model is 0.7 which is at least better than pure guess. We selected 0.5 as a threshold for classification, that value is our choice and it goodness for accuracy can be study from ROC-curve. In this study, it’s okay. Test error rate is same as train error rate, so our model seems to be fitted properly. When predictions are compared to high_use we can see that our predictions have too many < 2 points drinker compared to real situation, and reversal too less high user.

10-fold cross-validation

Lets perform 10-fold cross-validation for our model to see if we get better test set performance. 10-fold means that all our observations are grouped to 10 bins

library(caret)

#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)

#fit a regression model and use k-fold CV to evaluate performance
model <- train(factor(high_use) ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial", method = "glm", trControl = ctrl)
model
## Generalized Linear Model 
## 
## 370 samples
##   3 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 332, 333, 333, 333, 333, ... 
## Resampling results:
## 
##   Accuracy  Kappa     
##   0.697321  0.07746201
# Show also summary of model
summary(model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6143  -0.8690  -0.6895   1.2088   2.2716  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2829     1.7668  -2.424 0.015345 *  
## freetime      0.3246     0.1236   2.626 0.008630 ** 
## studytime    -0.5928     0.1575  -3.765 0.000167 ***
## age           0.2119     0.1014   2.089 0.036700 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 422.36  on 366  degrees of freedom
## AIC: 430.36
## 
## Number of Fisher Scoring iterations: 4
#View final model parameters
model$finalModel
## 
## Call:  NULL
## 
## Coefficients:
## (Intercept)     freetime    studytime          age  
##     -4.2829       0.3246      -0.5928       0.2119  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  366 Residual
## Null Deviance:       452 
## Residual Deviance: 422.4     AIC: 430.4

The results shows that our accuracy is app. 0.70 meaning that 30 % of our predictions are wrong. Actually, results are almost same than with Logistic regression model, so in this case there isn’t any point to use 10-fold cross-validation. Typically, as k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller but there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically choice for k has been done using k=10 or k=5 because those has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance According summary table of model, study time has most negative impact to alcohol use. Freetime has most positive correlation to high_use, and age variable has also significant effect to drinking habit. Basically, results seems to be line with earlier results.

Perform cross-validation to compare the performance of different logistic regression models

For testing and comparing different models performances, I have select 10 variables as a starting point and removed ‘the worst’ one after one that we have three variables left. Selected variables are: * famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

  • Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

  • Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

  • traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

  • sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

  • health - current health status (numeric: from 1 - very bad to 5 - very good)

  • freetime - free time after school (numeric: from 1 - very low to 5 - very high)

  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

  • age - student’s age (numeric: from 15 to 22)

The data has been divided to test and training sets, so we could better understood those two relationships.

# select only columns under interest
wrangled_student_mat_por_sub <- wrangled_student_mat_por[,colnames(wrangled_student_mat_por) %in% c("famsize","Medu","Fedu","traveltime","famrel","sex","health","freetime","studytime","age","high_use")]

# Divide data to test and training sets by using 70 % of the data for the training set
dat <- list()
dat[['index']] <- sample(nrow(wrangled_student_mat_por_sub), size = nrow(wrangled_student_mat_por_sub) * 0.7)
dat[['training']] <- wrangled_student_mat_por_sub[dat$index, ]
dat[['test']] <- wrangled_student_mat_por_sub[-dat$index,]

# Empty lists
count_of_variables <- test_mses <- training_mses <- list()
# For loop to test different counts of variables
for (i in 1:(ncol(wrangled_student_mat_por_sub)- 1)) {
    
    # Build new model every iteration containing one variable less    
    train_mod <-glm(high_use~., data = dat$training[,i:ncol(dat$training)])
    
    # Predict values for training data
    y_hat_training <- predict(train_mod, type = "response")
    train_predict = y_hat_training > 0.5
    # Collect training error
    training_mses[[i]] <- calc_class_err(actual = dat$training$high_use, predicted = train_predict)
    
    # Predict values for testing data
    y_hat_test <- predict(train_mod, newdata = dat$test)
    test_predict = y_hat_test > 0.5
    test_mses[[i]] <- calc_class_err(actual = dat$test$high_use, predicted = test_predict)
    
    count_of_variables[[i]] <- length(train_mod$coefficients) - 1
}

# Build dataframe from results to be used in ggplot
df<-data.frame(Number_of_Variables=unlist(count_of_variables), Test_error=unlist(test_mses),
                Training_error=unlist(training_mses))
# Build plot
library("reshape2")
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
df_long<-melt(df, id="Number_of_Variables",)
df_long$Number_of_Variables<-factor(df_long$Number_of_Variables, levels=df$Number_of_Variables)
a<-ggplot(df_long, aes(x = Number_of_Variables,y=value, color=variable, group=variable)) +
    geom_point() +
    geom_line() +
    geom_smooth() +
    scale_x_discrete(limits = rev(levels(df$Number_of_Variables)))
a
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In the plot, test - and training errors against ‘Number of prediction variables in model’ are shown with own lines. I have selected ten variables which might have effect to alcohol consumption. Typically, training error which is less than test error indicates that model over fits to the training data. The model can be think to be good enough when test and training errors are near each other. E.g. from the plot, can be see that training error first rise in range 10-8 and in same range test error is lower and remains almost unchanged. Over fitting can be seen with range 7-5 and 3-1. When we have four variables in our model, test and training error are near each others indicating goodness of our model. And this results, seems to change depending how data is divided to test and training sets. According these results, the logistic regression model with four variables seems to be good if selected variables are good. Different grouping for selected variables should be also performed to catch out which four variables would be the best ones.


Assignment 4: Clustering and classification

This assignment topic is clustering and classification. As basic, clustering means that in the data some points/observations are in space so near each other compared to other points/observations that those form a ‘own cluster’. There are multiple different clustering methods to find those clusters from the data. One of the most common methods is k-means clustering, others worth naming are hierarchical clustering methods which gives dendrograms as a output. If the clustering can be done successfully, new observations can be tried to classify to those clusters.

To clarify this topic, Boston dataset from MASS - package will be used as a demonstration dataset.

The Boston Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The Boston data frame has 506 rows and 14 columns. The following describes the dataset columns:

Summary and graphical overview of the data

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## corrplot 0.92 loaded
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
## [1] 506  14
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

There are 506 different observations for 14 variables whivh summaries informations concerning housing in the area of Boston Mass. By using this dataset, we can try to find out different relationships between variables. E.g. per capita crime rate relationship to other variables is very interesting. If we could find out variables which has the most effect to it, could in future try to effect these variables and in advance reduce develop of crime rate when planning new residential areas. From the summary table, we can see that not all information from variables are easy or wise to use as a predictors in same model as those are now. E.g. “proportion of residential land zoned for lots over 25,000 sq.ft.” (zn) - variable observations are unevenly distributed where as “pupil-teacher ratio by town” (ptratio) - variable observations are much more evenly distributed. But interesting is that when we plot other variables against crime rate, we can see that there could be correlation even that the distribution wouldn’t be so clear. From the correlation matrix, we can see that there are some kind positive or negative correlation with per capita crime rate and other variables. Six variables have negative correlation and seven have positive. Most less negative correlation is with “Charles River dummy variable (1 if tract bounds river; 0 otherwise)” (chas) and overall negative correlate factors impact seems to be low. Positively correlate factors impact seems to be much higher and “index of accessibility to radial highways” (rad) - and “full-value property-tax rate per $10,000” (tax) seems to have the most positive correlation to per capita crime rate. Same results are visualized in the plot where conclusions can be made more easily.

Standardize the dataset

To get more clear understanding about the data and make it usable in prediction model, it has to be standardized. To do that, we scale and center the columns of a numeric matrix. Centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns. Scaling is done by dividing the (centered) columns of x by their standard deviations.

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
## [1] "data.frame"
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

After scaling we can see that values across different variables are much more near each others. From the correlation matrix, we see that we got same results as with raw values.

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582

Divide the dataset to train and test sets

  • Data will be divided 80/20 to train and test sets
# Bring Boston data table from github to R

boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)

# Change 'crime' to factor variable with relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

# Build up test and train datasets
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]
dim(train)
## [1] 404  14
# create test set 
test <- boston_scaled[-ind,]
dim(test)
## [1] 102  14
# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
head(test)
##             zn      indus       chas        nox         rm         age
## 7   0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3880270 -0.07015919
## 16 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.6413655 -0.42896588
## 17 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.4976172 -1.39525719
## 23 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044  0.82152875
## 26 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.9758293  0.60837625
## 27 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.6712537  0.77179316
##           dis        rad        tax   ptratio      black       lstat
## 7  0.83841422 -0.5224844 -0.5769480 -1.503749  0.4263763 -0.03123671
## 16 0.33411879 -0.6373311 -0.6006817  1.175303  0.4265954 -0.58577611
## 17 0.33411879 -0.6373311 -0.6006817  1.175303  0.3305330 -0.85044265
## 23 0.08636389 -0.6373311 -0.6006817  1.175303  0.4406159  0.84958472
## 26 0.31322322 -0.6373311 -0.6006817  1.175303 -0.5833190  0.54010692
## 27 0.42121530 -0.6373311 -0.6006817  1.175303  0.2213265  0.30204708
##           medv
## 7   0.03992492
## 16 -0.28626471
## 17  0.06167090
## 23 -0.79729513
## 26 -0.93864397
## 27 -0.64507330

Now we have build up test and training data by selecting randomly first 80 % off data as a training set and using 20 % as a test set. Both data set will be used in next part.

Fit the linear discriminant analysis on the train set

  • Crime rate as the target variable and all the other variables in the dataset as predictor variables.

Now we use training set in linear discriminat analysis. The function tries hard to detect if the within-class covariance matrix is singular. If any variable has within-group variance less than ‘tolerance to decide’^2 it will stop and report the variable as constant. As a return function gives:

*prior - the prior probabilities used.

*means - the group means.

*scaling - a matrix which transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical.

*svd - the singular values, which give the ratio of the between- and within-group standard deviations on the linear discriminant variables. Their squares are the canonical F-statistics.

*N - The number of observations used.

*call - The (matched) function call.

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2524752 0.2301980 0.2673267 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.95720988 -0.9974152 -0.15538550 -0.9236781  0.4259313 -0.9151365
## med_low  -0.09293868 -0.3511530 -0.04073494 -0.5975619 -0.1266931 -0.3659605
## med_high -0.39410920  0.2733794  0.19334945  0.3750116  0.1650971  0.4654724
## high     -0.48724019  1.0169921 -0.05360128  1.0579430 -0.4086114  0.7968254
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9225903 -0.6885004 -0.7251658 -0.39837690  0.3884771 -0.77813734
## med_low   0.4378610 -0.5438774 -0.5293644 -0.05372787  0.3233714 -0.14698541
## med_high -0.4035150 -0.4051679 -0.2705798 -0.24915387  0.1170678 -0.07059714
## high     -0.8470258  1.6393984  1.5149640  0.78225547 -0.8112050  0.88739423
##                 medv
## low       0.49831286
## med_low  -0.01849074
## med_high  0.16899781
## high     -0.71896935
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.07641944  0.593281114 -1.02217403
## indus    0.10765110 -0.696940487  0.34908894
## chas    -0.08896789 -0.089544727  0.14830936
## nox      0.45028218 -0.589517652 -1.15508780
## rm      -0.08567328 -0.069019878 -0.14263845
## age      0.22932653 -0.398133937 -0.23362567
## dis     -0.07062677 -0.267905582  0.40947588
## rad      3.07547503  0.796406811  0.32208758
## tax      0.00738399  0.331248826  0.01112788
## ptratio  0.10860697  0.052156452 -0.20610157
## black   -0.13334457  0.006753021  0.12613448
## lstat    0.22046003 -0.014002027  0.55635912
## medv     0.18489459 -0.200969317 -0.15167768
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9448 0.0420 0.0132
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2) +
  lda.arrows(lda.fit, myscale = 1)

## integer(0)

From the results plot we can see that rad seems to most driving variable in our training set for explaining LD1. And from the result object we can see that it has most positive coefficient 3.62727 to LD1 whereas second most effective variable seems to be nox with only 0.307. For the LD2, zn has highest positive coefficient 0.624225492. ## Predict the classes with the LDA model on the test data

## $class
##   [1] med_low  med_low  med_low  med_low  med_low  med_low  med_low  low     
##   [9] med_low  med_low  med_low  med_low  med_high med_low  med_high med_high
##  [17] med_low  med_high med_high med_high med_high med_high med_high med_high
##  [25] med_high med_high med_high med_high med_high med_low  med_low  med_low 
##  [33] low      low      low      low      low      low      low      low     
##  [41] med_low  med_low  med_low  med_low  med_low  med_low  med_low  med_high
##  [49] med_high med_high med_low  med_high med_low  low      med_low  low     
##  [57] med_high med_high med_high med_high med_low  low      med_low  low     
##  [65] med_low  med_low  med_high med_low  med_low  med_low  med_low  low     
##  [73] med_low  med_low  med_low  low      low      high     high     high    
##  [81] high     high     high     high     high     high     high     high    
##  [89] high     high     high     high     high     high     high     high    
##  [97] high     med_high med_high med_high med_high med_low 
## Levels: low med_low med_high high
## 
## $posterior
##              low      med_low     med_high         high
## 7   1.116775e-01 7.133338e-01 1.749887e-01 2.009488e-15
## 16  2.042848e-01 6.035914e-01 1.921238e-01 2.978725e-16
## 17  3.252279e-01 5.753440e-01 9.942816e-02 5.257400e-17
## 23  5.703718e-02 6.031237e-01 3.398391e-01 1.126166e-14
## 26  6.346462e-02 6.082181e-01 3.283173e-01 1.716595e-14
## 27  6.152680e-02 5.661387e-01 3.723345e-01 7.040839e-15
## 35  4.132069e-02 4.980798e-01 4.605995e-01 1.109314e-13
## 40  9.952841e-01 4.518512e-03 1.974271e-04 4.085546e-21
## 54  4.544407e-01 5.402797e-01 5.279609e-03 1.237658e-19
## 76  9.344795e-02 8.756093e-01 3.094275e-02 4.314846e-17
## 106 7.010030e-02 4.999078e-01 4.299919e-01 5.594781e-13
## 107 6.366763e-02 5.685396e-01 3.677928e-01 6.723147e-13
## 109 9.678023e-02 4.167919e-01 4.864279e-01 1.046626e-13
## 110 8.265270e-02 5.203908e-01 3.969565e-01 1.965004e-13
## 112 6.771302e-02 3.816869e-01 5.506001e-01 3.365796e-13
## 116 3.958863e-02 4.601686e-01 5.002428e-01 3.982344e-12
## 120 7.863042e-02 6.321376e-01 2.892320e-01 4.879579e-13
## 121 3.461152e-03 2.074338e-01 7.891051e-01 5.221521e-17
## 122 2.170660e-03 1.412482e-01 8.565811e-01 7.444952e-17
## 123 1.137547e-03 1.229520e-01 8.759105e-01 2.801245e-16
## 139 1.708092e-03 9.037120e-02 9.079207e-01 9.576204e-13
## 142 5.543856e-04 1.534891e-01 8.459565e-01 6.919590e-11
## 143 1.342218e-05 2.503381e-03 9.974832e-01 3.539774e-11
## 146 6.576913e-06 5.819189e-04 9.994115e-01 3.092960e-09
## 147 1.152300e-05 3.441798e-04 9.996443e-01 4.153527e-10
## 148 1.251540e-05 1.789921e-03 9.981976e-01 1.909610e-09
## 153 4.364065e-05 1.622610e-03 9.983337e-01 1.803712e-12
## 157 1.455603e-05 4.124797e-04 9.995730e-01 6.925407e-10
## 158 8.305436e-04 1.667995e-02 9.824895e-01 1.108524e-13
## 178 2.801299e-01 5.482736e-01 1.715965e-01 8.088377e-16
## 179 2.600974e-01 4.848259e-01 2.550767e-01 1.680102e-15
## 185 1.991209e-01 6.695253e-01 1.313537e-01 8.343942e-17
## 189 9.442562e-01 5.400591e-02 1.737848e-03 2.294184e-18
## 191 8.993926e-01 9.715025e-02 3.457203e-03 2.940146e-18
## 193 9.196202e-01 7.693561e-02 3.444231e-03 1.285632e-18
## 197 9.946295e-01 5.298716e-03 7.178030e-05 6.441128e-24
## 198 9.910682e-01 8.834800e-03 9.698475e-05 3.904672e-23
## 199 9.922042e-01 7.676755e-03 1.189993e-04 2.800242e-23
## 204 9.971909e-01 2.184586e-03 6.244859e-04 2.291078e-19
## 205 9.974525e-01 1.917999e-03 6.295334e-04 1.724477e-19
## 207 1.017639e-01 7.970500e-01 1.011862e-01 4.073316e-17
## 210 7.861840e-03 8.728449e-01 1.192932e-01 6.098211e-16
## 212 9.425262e-03 9.112599e-01 7.931483e-02 3.274059e-16
## 213 3.743648e-02 9.174359e-01 4.512763e-02 7.551064e-18
## 215 2.693103e-02 9.611961e-01 1.187288e-02 5.711630e-16
## 216 1.452504e-01 7.766207e-01 7.812888e-02 2.257245e-17
## 217 1.929481e-02 7.442701e-01 2.364351e-01 4.560535e-16
## 218 1.270115e-02 2.391894e-01 7.481094e-01 1.742492e-14
## 225 3.210085e-02 1.323957e-01 8.355035e-01 1.262223e-11
## 226 1.696488e-02 7.936452e-02 9.036706e-01 1.940899e-11
## 230 2.462753e-01 6.039777e-01 1.497470e-01 4.286857e-13
## 233 4.375921e-02 1.720882e-01 7.841526e-01 3.290605e-12
## 236 6.963796e-02 6.504862e-01 2.798759e-01 1.345188e-11
## 252 5.060448e-01 4.841532e-01 9.801980e-03 2.732455e-16
## 254 3.817954e-01 5.742306e-01 4.397404e-02 5.112891e-16
## 257 9.960235e-01 3.612501e-03 3.639795e-04 4.737757e-21
## 259 4.093459e-02 2.190876e-02 9.371567e-01 1.868553e-13
## 262 3.654754e-02 1.952215e-02 9.439303e-01 1.722002e-13
## 264 6.195266e-02 4.363880e-02 8.944085e-01 1.754541e-13
## 266 3.302785e-01 1.717248e-01 4.979967e-01 8.387289e-14
## 270 2.827773e-01 6.922895e-01 2.493319e-02 2.080806e-19
## 280 6.678464e-01 3.155085e-01 1.664508e-02 6.195425e-18
## 283 3.814058e-01 5.454876e-01 7.310664e-02 1.916194e-18
## 292 9.941298e-01 5.178059e-03 6.921040e-04 3.044488e-19
## 294 6.570856e-02 9.244183e-01 9.873107e-03 8.605026e-20
## 297 4.372992e-02 9.139900e-01 4.228006e-02 6.444885e-19
## 313 7.439606e-02 4.617269e-01 4.638770e-01 2.903775e-15
## 321 1.996487e-01 6.829992e-01 1.173521e-01 3.522681e-16
## 324 9.137122e-02 7.658952e-01 1.427335e-01 3.447146e-15
## 325 2.394066e-01 6.762450e-01 8.434838e-02 1.380862e-16
## 330 4.836940e-01 5.131962e-01 3.109807e-03 7.444364e-20
## 333 8.615764e-01 1.377343e-01 6.892534e-04 1.079480e-23
## 334 2.247274e-01 6.806355e-01 9.463703e-02 9.427207e-17
## 338 1.355479e-01 7.307107e-01 1.337414e-01 1.271498e-15
## 340 2.188496e-01 6.913376e-01 8.981275e-02 5.386374e-16
## 348 9.903906e-01 9.335274e-03 2.740760e-04 5.066698e-20
## 352 9.379721e-01 6.131054e-02 7.173858e-04 2.934268e-20
## 367 2.476981e-20 3.620420e-17 4.238775e-13 1.000000e+00
## 369 6.998706e-20 6.701983e-17 1.529547e-12 1.000000e+00
## 371 1.782481e-17 1.945155e-14 3.153721e-10 1.000000e+00
## 386 9.030824e-21 6.238381e-17 9.334043e-14 1.000000e+00
## 391 2.736276e-19 4.947836e-16 2.329180e-12 1.000000e+00
## 396 9.855294e-19 1.402757e-15 6.291107e-12 1.000000e+00
## 399 2.002490e-20 1.310909e-16 1.618393e-13 1.000000e+00
## 415 1.194680e-23 2.096343e-19 5.037787e-16 1.000000e+00
## 416 9.322605e-22 4.669631e-18 2.080823e-14 1.000000e+00
## 428 1.759625e-19 1.483625e-16 7.509318e-13 1.000000e+00
## 434 2.990078e-20 3.805701e-17 4.898208e-13 1.000000e+00
## 440 1.993891e-20 5.728044e-17 3.693258e-13 1.000000e+00
## 444 6.164393e-20 1.108200e-16 1.513367e-12 1.000000e+00
## 450 8.332681e-20 1.664108e-16 1.275593e-12 1.000000e+00
## 461 1.638814e-19 2.421737e-16 2.359750e-12 1.000000e+00
## 474 7.254837e-17 7.385543e-14 8.711083e-11 1.000000e+00
## 475 7.999821e-18 2.490216e-14 8.542058e-12 1.000000e+00
## 476 1.332341e-18 6.493563e-15 2.450574e-12 1.000000e+00
## 477 8.764637e-18 2.156858e-14 1.670989e-11 1.000000e+00
## 485 2.000381e-15 2.907725e-12 2.628919e-10 1.000000e+00
## 491 9.308589e-04 2.597028e-01 7.393663e-01 1.806241e-11
## 493 2.891066e-03 1.236607e-01 8.734482e-01 1.296464e-13
## 497 2.341412e-02 4.505605e-01 5.260254e-01 2.829887e-11
## 502 2.199584e-01 3.686441e-01 4.113975e-01 7.633848e-19
## 506 2.788546e-01 4.017031e-01 3.194423e-01 3.300769e-19
## 
## $x
##            LD1          LD2         LD3
## 7   -2.1153438 -0.367901917  0.60398591
## 16  -2.3676396 -0.375366624  0.09105350
## 17  -2.5827588 -0.067441975  0.08415944
## 23  -1.8808423 -0.754413526  0.51278552
## 26  -1.8373617 -0.651161446  0.48581655
## 27  -1.9391051 -0.816408300  0.37293279
## 35  -1.5927214 -0.772841977  0.38570442
## 40  -3.4366062  2.425925315 -2.34128404
## 54  -3.2526389  0.798851930  1.07525816
## 76  -2.5308783 -0.039954423  1.59767126
## 106 -1.4325477 -0.359090649  0.16896688
## 107 -1.4080716 -0.307145288  0.40495443
## 109 -1.6415585 -0.465655580 -0.22795932
## 110 -1.5644384 -0.370402164  0.14916532
## 112 -1.4832999 -0.536153224 -0.17413636
## 116 -1.1699358 -0.441314009  0.32965126
## 120 -1.4575458 -0.148514823  0.49848714
## 121 -2.3206070 -2.798614010  0.51592639
## 122 -2.2381038 -2.971282858  0.36186643
## 123 -2.0410621 -3.088981886  0.55850315
## 139 -1.1008719 -2.062661253  0.12695970
## 142 -0.5572067 -2.025984236  1.24598637
## 143 -0.2555841 -3.518403993 -0.77195107
## 146  0.3682290 -3.276907739 -1.72420183
## 147  0.1234483 -3.257738793 -2.50674035
## 148  0.2285219 -3.107811247 -1.01209187
## 153 -0.6526576 -3.363647041 -1.78754905
## 157  0.1623817 -3.116205975 -2.45184640
## 158 -1.2430546 -2.566351066 -1.12562615
## 178 -2.2627654 -0.088015205 -0.09778702
## 179 -2.1752600 -0.221892497 -0.33924517
## 185 -2.5122614 -0.345957155  0.35280215
## 189 -2.8367452  1.989588293 -0.90989546
## 191 -2.8420392  1.656109360 -0.63878566
## 193 -2.9302303  1.584350877 -0.87085264
## 197 -4.1788730  2.204291467 -1.81295547
## 198 -3.9948107  2.240135078 -1.45359844
## 199 -4.0316561  2.112429413 -1.67521688
## 204 -2.9569406  2.337583807 -3.47573242
## 205 -2.9849004  2.307072267 -3.60196275
## 207 -2.5603318 -0.568520988  0.95555825
## 210 -2.1050428 -1.362848827  2.27090508
## 212 -2.1823619 -1.167356634  2.39109454
## 213 -2.6914306 -0.765629062  1.91924868
## 215 -2.1443862  0.198308420  2.74191428
## 216 -2.6437572 -0.369683832  0.86051377
## 217 -2.1960588 -1.359602699  1.37706015
## 218 -1.7203538 -1.642532403  0.07195037
## 225 -0.9807610 -0.605710244 -0.92727213
## 226 -0.8743177 -0.832088700 -1.11125812
## 230 -1.5236774  0.597308107  0.16964324
## 233 -1.1652914 -0.606253637 -0.82431850
## 236 -1.0635143  0.175095147  0.62879946
## 252 -2.3654996  1.380397533  0.72015484
## 254 -2.3109051  0.624713701  0.37490516
## 257 -3.4212706  2.160179094 -2.81140718
## 259 -1.4143553 -0.965874758 -2.79573160
## 262 -1.4127853 -1.019245271 -2.84940277
## 264 -1.4731059 -0.807707613 -2.34702261
## 266 -1.7008663  0.006647663 -1.67165694
## 270 -3.2037471 -0.070375069  0.87445463
## 280 -2.8158487  0.844809765 -0.07576998
## 283 -2.9711072 -0.214497123  0.05935791
## 292 -2.9613163  2.293774441 -2.71953681
## 294 -3.2188923 -0.309737109  2.26312467
## 297 -2.9866017 -0.938404478  1.84480769
## 313 -2.0490142 -0.934878448 -0.01290974
## 321 -2.3428064 -0.137499837  0.43120716
## 324 -2.0400476 -0.294640678  0.86266098
## 325 -2.4561360 -0.010958150  0.46590194
## 330 -3.3037467  1.019662802  1.22101507
## 333 -4.2874217  1.043863374  0.29258386
## 334 -2.4995375 -0.131273598  0.45038028
## 338 -2.1759322 -0.215048042  0.64166414
## 340 -2.2939822  0.069929531  0.51589149
## 348 -3.1782122  2.519290545 -1.78881394
## 352 -3.3349414  1.931745633 -0.44506710
## 367  6.3376844  0.093591647 -0.71571070
## 369  6.2293590 -0.121282243 -1.22089442
## 371  5.5801966 -0.621089607 -1.05742940
## 386  6.4002977  0.395294305  0.94475761
## 391  6.0608661  0.158412330 -0.23838124
## 396  5.9263183  0.163154291 -0.34609478
## 399  6.3139659  0.427169306  0.99452583
## 415  7.1104607  0.418406591  1.25537523
## 416  6.6649642  0.285724345  0.33505733
## 428  6.1569011  0.553207429 -0.64027535
## 434  6.3222346  0.097894600 -0.82618685
## 440  6.3334831  0.060243006 -0.12429065
## 444  6.2158615 -0.180491230 -0.68866106
## 450  6.1849368  0.006745378 -0.39002302
## 461  6.1195238 -0.028491153 -0.64747970
## 474  5.4684501  0.498158194  0.02927057
## 475  5.6818552  0.758534645  1.13100383
## 476  5.8627466  0.682073930  1.32542854
## 477  5.6703027  0.482178797  0.66263098
## 485  5.1065701  1.174210344  1.28550061
## 491 -0.7631106 -1.918544893  1.51905993
## 493 -1.3769112 -2.061830919  0.15283869
## 497 -0.9108578 -0.459711910  0.56793353
## 502 -3.0628150 -1.332689893 -0.78241737
## 506 -3.1732525 -1.212914752 -0.71960480
## [1] 0.4215686
##           predicted
## correct    low med_low med_high high
##   low       11      11        4    0
##   med_low    5      12        7    0
##   med_high   0      15       17    1
##   high       0       0        0   19

The results shows as that the high and low variable groups are grouped much better to rigth classes compared to med_high and med_low groups. Overall our error rate is still around 33 % depending how observations are divided to train and test data. This is pretty high and could be better.

Reload the Boston dataset and standardize the dataset

To get better error rate we can try to do standardise our data set by using distances. * Scale all variables to have mean = 0 and standard deviation = 1

# Bring data to R
library(MASS)
library(dplyr)
data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))

# Calculate distances

## Euclidean distance matrix
dist_eu <- dist(standardised_boston, method = "euclidean")

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(standardised_boston, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(13)

# K-means clustering, center 4
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.269   2.000   4.000
# K-means clustering, center 3
km <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.929   2.000   3.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.729   2.000   2.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 5)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.000   2.666   3.000   5.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 6)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:ncol(Boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.312   6.000   6.000

From the summaries, we can see that distances between used method differs quite much qiven longer distances by Manhattan method. Euclidean distance is the shortest path between source and destination. Manhattan distance is sum of all the real distances between source and destination. Two clusters seems to be best. By using center 4, number of cluster range between 1-4, and 2 seems to be the most common. When studying by setting higher value for center, 4 cluster seems to be most common choice. So, it seems to be hard to use summary as a only way to select good number of clusters. But from the plots we can see that 4 clusters could be presented in the data, but even that atleast 2 and 3 clusters should be also tried.

Bonus: k-means on the original Boston data with 3 clusters

data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))

# K-means clustering, center 3
km <- kmeans(standardised_boston, centers = 3)
# plot the Boston dataset with clusters
pairs(standardised_boston[1:5], col = km$cluster)

pairs(standardised_boston[6:10], col = km$cluster)

pairs(standardised_boston[11:ncol(standardised_boston)], col = km$cluster)

# Summary of clusters
summary(km$cluster)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.743   2.000   3.000
standardised_boston$N_clusters <- factor(km$cluster)

# Build up test and train datasets
# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston),  size = n * 0.8)

# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404  15
# create test set 
test <- standardised_boston[-ind,]
dim(test)
## [1] 102  15
# save the 'correct' clusters from test data
correct_classes <- test$N_clusters

# remove the N_clusters variable from test data
test <- dplyr::select(test, -N_clusters)
head(test)
##          crim          zn      indus       chas        nox         rm
## 3  -0.4169290 -0.48724019 -0.5927944 -0.2723291 -0.7395304  1.2814456
## 8  -0.4032966  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069
## 10 -0.4003331  0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130
## 11 -0.3939564  0.04872402 -0.4761823 -0.2723291 -0.2648919  0.1314594
## 21 -0.2745709 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -1.0171035
## 23 -0.2768170 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044
##           age         dis        rad        tax    ptratio     black      lstat
## 3  -0.2655490 0.556609050 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 8   0.9778406 1.023624897 -0.5224844 -0.5769480 -1.5037485 0.4406159  0.9097999
## 10  0.6154813 1.328320207 -0.5224844 -0.5769480 -1.5037485 0.3289995  0.6227277
## 11  0.9138948 1.211779950 -0.5224844 -0.5769480 -1.5037485 0.3926395  1.0918456
## 21  1.0488914 0.001356935 -0.6373311 -0.6006817  1.1753027 0.2179309  1.1716657
## 23  0.8215287 0.086363887 -0.6373311 -0.6006817  1.1753027 0.4406159  0.8495847
##          medv
## 3   1.3229375
## 8   0.4965904
## 10 -0.3949946
## 11 -0.8190411
## 21 -0.9712629
## 23 -0.7972951
# linear discriminant analysis
lda.fit <- lda(N_clusters ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(N_clusters ~ ., data = train)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4702970 0.3267327 0.2029703 
## 
## Group means:
##         crim         zn      indus        chas        nox          rm
## 1 -0.3724278 -0.3387499 -0.2829898  0.03849464 -0.3098577 -0.07912173
## 2  0.7707649 -0.4872402  1.1304030  0.05576262  1.1890969 -0.51987167
## 3 -0.4082840  1.5687708 -1.1015857 -0.08027540 -1.0112004  0.88743088
##           age         dis        rad        tax    ptratio      black
## 1  0.02723655  0.01214864 -0.5786988 -0.6052722 -0.1000423  0.2679608
## 2  0.82255145 -0.85370946  1.1680242  1.2695565  0.5468315 -0.5919408
## 3 -1.17655926  1.26178915 -0.6037174 -0.6590027 -0.7421679  0.3540631
##        lstat        medv
## 1 -0.1783253  0.06127032
## 2  0.8633549 -0.69424886
## 3 -0.9375719  0.95206252
## 
## Coefficients of linear discriminants:
##                 LD1          LD2
## crim    -0.05143915  0.128168802
## zn      -0.05754639  1.091843941
## indus    0.49017626  0.071189895
## chas     0.04698341 -0.068524065
## nox      1.06863590  0.719733618
## rm      -0.15043419  0.380911844
## age     -0.12108442 -0.593230736
## dis     -0.12373888  0.564951427
## rad      0.69328145  0.007068937
## tax      0.92825702  0.816155360
## ptratio  0.24143922  0.085061886
## black    0.01470321 -0.013028639
## lstat    0.17568563  0.484228519
## medv    -0.06969142  0.626189685
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8378 0.1622
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$N_clusters)

# plot the lda results
plot(lda.fit, dimen = 2) +
  lda.arrows(lda.fit, myscale = 1)

## integer(0)

Plot of the LDA results shows that tax and rad has most effect to LD1 and age for LD2, and even so are the most influential linear separators for the clusters.

SuperBonus

# Add crime factor to data set scaled in last code chunk

# create a quantile vector of crim and print it
bins <- quantile(standardised_boston[['crim']])
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(standardised_boston[['crim']], breaks = bins, include.lowest = TRUE)

# remove original crim from the dataset
standardised_boston <- dplyr::select(standardised_boston, -crim)

# add the new categorical value to scaled data
standardised_boston <- data.frame(standardised_boston, crime)

# Set crime to character fow a while 
standardised_boston$crime<-as.character(standardised_boston$crime)

# Set 'crime' to factor variable, and give relevant levels
standardised_boston$crime[standardised_boston$crime=="[-0.419,-0.411]"] <- "low"
standardised_boston$crime[standardised_boston$crime=="(-0.411,-0.39]"] <- "med_low"
standardised_boston$crime[standardised_boston$crime=="(-0.39,0.00739]"] <- "med_high"
standardised_boston$crime[standardised_boston$crime=="(0.00739,9.92]"] <- "high"
# back to factor with correct levels
standardised_boston$crime<- factor(standardised_boston$crime, levels = c("low", "med_low", "med_high", "high"))

# Check head of edited table
head(standardised_boston)
##           zn      indus       chas        nox        rm        age      dis
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
##          rad        tax    ptratio     black      lstat       medv N_clusters
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278          1
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239          1
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375          1
## 4 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886          3
## 5 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323          3
## 6 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582          1
##   crime
## 1   low
## 2   low
## 3   low
## 4   low
## 5   low
## 6   low
# Build up test and train datasets

# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston),  size = n * 0.8)

# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404  15
# create test set
test <- standardised_boston[-ind,]
dim(test)
## [1] 102  15
# Save n_cluster column
n_clusterit<- train$N_clusters

# remove original n_cluster from the dataset
train <- dplyr::select(train, -N_clusters)

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2500000 0.2574257 0.2425743 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.08456771 -0.9369443 -0.11640431 -0.8924742  0.5212608 -0.8766213
## med_low  -0.09115733 -0.3312543  0.03952046 -0.5917217 -0.1115500 -0.3625932
## med_high -0.39571399  0.1510421  0.14409500  0.3846657  0.1429959  0.4294170
## high     -0.48724019  1.0149946 -0.03128211  1.0406720 -0.4855777  0.7957547
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8529692 -0.6987343 -0.7602375 -0.49441664  0.3799540 -0.79526046
## med_low   0.3642410 -0.5554602 -0.5409951 -0.04120056  0.3201782 -0.17644351
## med_high -0.3749755 -0.3822389 -0.2825025 -0.26326606  0.1074278 -0.05752024
## high     -0.8584374  1.6596029  1.5294129  0.80577843 -0.7464259  0.89613928
##                medv
## low       0.5929402
## med_low   0.0102126
## med_high  0.2096063
## high     -0.6231054
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.11030024  0.90477115 -0.9432520
## indus   -0.02369693 -0.21284527  0.3534688
## chas    -0.06681638 -0.04469796  0.1555483
## nox      0.32271479 -0.76681432 -1.3321508
## rm      -0.11928537 -0.08149308 -0.1426609
## age      0.35340567 -0.29789983 -0.2669457
## dis     -0.08563196 -0.47438844  0.2270172
## rad      3.09835255  1.17955576  0.1675980
## tax      0.10830898 -0.36338770  0.2630112
## ptratio  0.15263545  0.06163493 -0.2299144
## black   -0.17731254  0.01064261  0.1035738
## lstat    0.14115142 -0.15376102  0.3960080
## medv     0.18691351 -0.36944033 -0.1857561
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9436 0.0416 0.0148
# Collect crime variables
target<- factor(train$crime)


# Collect predictor data
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

# Next, access the plotly package. Create a 3D plot of the columns of the matrix.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=target)
# Let's colorcode clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=n_clusterit)

(more chapters to be added similarly as we proceed with the course!)